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We  describe  a  stochastic  virus  evolution  model  representing  genomic  diversification  and  within-host  selection  during  experi¬ 
mental  serial  passages  under  cell  culture  or  live-host  conditions.  The  model  incorporates  realistic  descriptions  of  the  virus  geno¬ 
types  in  nucleotide  and  amino  acid  sequence  spaces,  as  well  as  their  diversification  from  error-prone  replications.  It  quantita¬ 
tively  considers  factors  such  as  target  cell  number,  bottleneck  size,  passage  period,  infection  and  cell  death  rates,  and  the 
replication  rate  of  different  genotypes,  allowing  for  systematic  examinations  of  how  their  changes  affect  the  evolutionary  dy¬ 
namics  of  viruses  during  passages.  The  relative  probability  for  a  viral  population  to  achieve  adaptation  under  a  new  host  envi¬ 
ronment,  quantified  by  the  rate  with  which  a  target  sequence  frequency  rises  above  50%,  was  found  to  be  most  sensitive  to  fac¬ 
tors  related  to  sequence  structure  (distance  from  the  wild  type  to  the  target)  and  selection  strength  (host  cell  number  and 
bottleneck  size).  For  parameter  values  representative  of  RNA  viruses,  the  likelihood  of  observing  adaptations  during  passages 
became  negligible  as  the  required  number  of  mutations  rose  above  two  amino  acid  sites.  We  modeled  the  specific  adaptation 
process  of  influenza  A  H5N 1  viruses  in  mammalian  hosts  by  simulating  the  evolutionary  dynamics  of  H5  strains  under  the  fit¬ 
ness  landscape  inferred  from  multiple  sequence  alignments  of  H3  proteins.  In  light  of  comparisons  with  experimental  findings, 
we  observed  that  the  evolutionary  dynamics  of  adaptation  is  strongly  affected  not  only  by  the  tendency  toward  increasing  fitness 
values  but  also  by  the  accessibility  of  pathways  between  genotypes  constrained  by  the  genetic  code. 


Viruses  with  RNA  genomes  evolve  rapidly,  evading  selective 
pressure  from  the  host  immune  response  and  adapting  to 
changing  environments  (1-4).  In  particular,  their  capacity  to 
switch  host  species,  emerge  into  new  vulnerable  populations,  and 
cause  outbreaks  has  significant  implications  for  viral  disease  con¬ 
trol  (5).  Many  such  outbreaks  in  recent  history  have  been  attrib¬ 
uted  to  viral  species  jumps:  influenza  A  virus  has  jumped  from 
birds  and  pigs  to  humans  multiple  times  (6-9),  the  severe  acute 
respiratory  syndrome  (SARS)  epidemic  was  caused  by  a  species 
jump  of  coronavirus  from  bats  and  palm  civets  to  humans  ( 10— 
13),  and  human  immunodeficiency  virus  type  1  (HIV-1)  is  be¬ 
lieved  to  have  switched  hosts  from  primates  (14). 

Characterizing  the  complex  factors  affecting  the  evolution  of 
viruses  in  natural  settings  among  diverse  groups  of  interacting 
hosts  is  a  challenging  task.  Valuable  insights  have  been  gained  by 
evolutionary  experiments  under  more  controlled  conditions,  par¬ 
ticularly  within  the  context  of  serial-passage  experiments  (15).  In 
a  serial-passage  experiment,  a  cell  culture  or  live  host  is  inoculated 
by  viral  (or  other)  pathogens,  usually  already  well  adapted  to  dif¬ 
ferent  cell  types  or  hosts.  A  pathogen’s  growth  under  the  restric¬ 
tive  host  environment  leads  to  within-host  selection  for  advanta¬ 
geous  variants,  either  present  as  a  minority  in  the  founder 
population  or  generated  from  error-prone  replications.  After  a 
certain  amount  of  time  (approximately  days)  of  such  growths,  a 
small  subset  of  the  resulting  pathogen  population  is  sampled  and 
used  to  inoculate  a  fresh  new  medium  or  host,  initiating  a  subse¬ 
quent  round  of  the  passage.  Generally,  rapid  adaptation  to  the  new 
host  environment  in  the  form  of  increased  fitness  and  virulence  is 
observed  (typically  within  ~10  passages)  along  with  attenuation 
of  adaptation  to  the  former  host  (15).  In  addition  to  revealing  key 
adaptation  strategies  a  pathogen  can  exhibit,  an  important  appli¬ 


cation  of  serial  passages  is  the  production  of  attenuated  vaccine 
strains  capable  of  eliciting  immune  responses  without  virulence 
(16). 

The  recent  availability  of  rapid  and  inexpensive  deep-sequenc¬ 
ing  techniques  (17)  has  the  potential  to  significantly  improve  our 
understanding  of  how  key  factors  affect  virus  evolutionary  dy¬ 
namics,  including  species  jumps,  especially  when  combined  with 
controlled  experiments  such  as  serial  passages.  A  major  obstacle  in 
leveraging  the  growing  sequence  data,  however,  is  the  lack  of 
quantitative  connections  between  such  genotype  data  and  exper¬ 
imentally  measurable  viral  phenotypes,  including  growth  proper¬ 
ties,  infectivity,  virulence,  and  tissue  tropism.  The  main  objective 
of  this  study  was  to  develop,  validate,  and  apply  stochastic  models 
of  viral  evolutionary  dynamics  that  can  bridge  this  gap  by  realis¬ 
tically  modeling  genomic  diversification  and  adaptation  processes 
during  serial  passages. 

Mathematical  models  of  viral  dynamics  have  long  been  used  to 
obtain  insights  into  how  viruses  interact  with  and  adapt  to  differ¬ 
ent  hosts.  The  mainstay  among  such  models  is  the  continuum 
description  of  population  dynamics  formulated  with  ordinary  dif¬ 
ferential  equations  (18-20).  One  of  their  drawbacks,  however,  is 
that  they  are  not  readily  extendable  to  descriptions  of  extensive 
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genomic  diversification  processes.  Lee  et  al.  have  modeled  HIV-1 
intrahost  sequence  evolution  and  divergence  (21).  Lim  et  al.  stud¬ 
ied  the  design  of  growth-attenuated  viruses  using  computational 
models  for  the  intracellular  growth  of  vesicular  stomatitis  viruses 
(22, 23).  Evolutionary  bottleneck  effects  associated  with  serial  pas¬ 
sages  have  been  modeled  using  a  different  mathematical  approach 
(24,  25).  Explicit  descriptions  of  within-host  dynamics  with  mu¬ 
tations  have  been  used  by  Russell  et  al.  (26)  to  examine  the  poten¬ 
tial  for  emergence  of  airborne-transmissible  influenza  A  H5N1 
virus  variants  (6,  27),  but  without  consideration  of  serial  passages 
used  in  the  experiment. 

Recently,  we  introduced  a  stochastic  dynamics-based  model¬ 
ing  framework  of  virus  dynamics  under  immune  selective  pres¬ 
sure  (28).  It  combines  the  classic  quasispecies  theory  description 
of  mutation-selection  balance  (1,  29,  30)  and  the  discrete-state 
stochastic  simulation  algorithm  widely  used  for  chemical  reac¬ 
tions  (31).  The  model  was  successfully  used  to  describe  HIV-1 
within-host  dynamics  in  both  short  (immune  escape  dynamics) 
and  long  (disease  progression)  time  scales.  In  this  study,  we  devel¬ 
oped  an  analogous  stochastic  model  suitable  for  describing  the 
within-host  adaptation  of  viruses  in  serial  passages.  We  demon¬ 
strated  that  the  approach  allows  one  to  quantitatively  probe  the 
effects  of  key  factors  of  evolutionary  dynamics  on  adaptation,  in¬ 
cluding  the  fitness  distribution  in  sequence  space,  bottleneck  size, 
and  mutation  rate.  In  particular,  we  examined  the  adaptation  pro¬ 
cess  by  considering  the  relative  accessibility  of  genotypes  well 
adapted  to  a  new  host  environment  from  the  starting  stock  during 
passages.  We  found  that  the  number  of  passage  rounds  necessary 
to  achieve  this  adaptation  increases  exponentially  with  the  re¬ 
quired  number  of  amino  acid  mutations,  rendering  triple  mutants 
practically  inaccessible.  This  observation  is  consistent  with  exper¬ 
imental  studies  of  the  adaptation  of  influenza  A  H5N 1  viruses  (6) 
and  SARS  coronavirus  (CoV)  (12). 

In  particular,  for  influenza  A  H5N 1  viruses,  which  are  nor¬ 
mally  restricted  to  avian  hosts  but  can  occasionally  spill  over  into 
mammals,  specific  hemagglutinin  (HA)  mutations  that  allow  for 
binding  to  mammalian  receptors  have  been  suggested  as  the  key 
factor  for  potential  species  jumps  (6, 27).  Upon  serial  passaging  in 
ferrets  of  an  engineered  strain  containing  three  mutations  (two  in 
HA)  known  to  facilitate  receptor  binding,  viral  strains  capable  of 
respiratory  infection  containing  two  additional  mutations  were 
found.  Modeling  the  evolutionary  adaptation  of  such  specific  sys¬ 
tems  requires  an  empirical  fitness  landscape  (the  correspondence 
relationship  between  genotypes  and  replication  rate),  which  could 
be  inferred  from  a  sufficiently  large  set  of  multiple-sequence  align¬ 
ments  (MSAs).  For  this  fitness  inference  procedure,  we  adopted 
the  direct  coupling  analysis  (32,  33),  a  method  intended  for  ex¬ 
tracting  the  degree  of  coevolution  between  sites  within  a  protein 
due  mostly  to  direct  spatial  interactions  that  are  obscured  in  fre¬ 
quency  correlation  data.  It  allows  for  the  determination  of  a  max¬ 
imum-entropy  fitness  function  (a  “disordered  Potts  model”)  that 
can  be  used  to  estimate  fitness  values  of  arbitrary  sequences,  while 
taking  into  account  interresidue  correlations.  Ferguson  et  al.  have 
recently  used  a  similar  approach  in  their  study  of  HIV  fitness  land¬ 
scapes,  but  employing  a  reduced  description  of  a  two-letter  alpha¬ 
bet  (wild  type  [WT]  and  mutant)  and  Monte  Carlo  simulations 
for  inference  (34)  rather  than  the  mean  field  approximation  de¬ 
veloped  by  Morcos  et  al.  (33).  We  used  this  inferred  landscape  to 
simulate  the  adaptation  of  H5N1:  an  H5  protein  segment  evolves 
under  the  fitness  landscape  inferred  from  sequences  (H3)  natural 


to  mammalian  hosts,  e.g.,  the  H3N2  subtype  endemic  in  human 
populations.  Our  simulations  revealed  that  both  the  fitness  values 
of  genotypes  and  their  relative  accessibilities  dictated  by  the  ge¬ 
netic  code  strongly  affect  the  adaptation  dynamics. 

MATERIALS  AND  METHODS 

Stochastic  model  and  simulation.  The  basic  model  of  virus  evolution  can 
be  described  by  the  following  set  of  events: 


U  +  V„  ->  I„  (infection) 

a 

(i) 

+  vm  (replication) 

(2) 

ln~*  0  (death) 
b 

(3) 

Vn  -*  0  (clearance) 
b 

(4) 

where  U is  an  uninfected  host  cell  and  Vn  and  In  are  a  virion  with  genotype 
n  and  a  cell  infected  with  it,  respectively.  Equation  1  represents  the  infec¬ 
tion  of  a  host  cell  with  rate  a ,  and  equation  2  corresponds  to  a  replication 
event  in  which  an  infected  cell  of  genotype  n  sheds  a  virion  with  genotype 
m  with  replication  rate  rn  and  mutation  probability  (29,  30), 

Qmn  =  (1  -  |ui)i“‘i">»  (p./3)‘i”»  (5) 

where  p.  is  the  mutation  rate  (probability  for  substitution  per  nucleotide 
site  per  replication),  L  is  the  total  length  of  the  genomic  nucleotide  seg¬ 
ment,  and  dmn  is  the  Hamming  distance  (total  number  of  nucleotides  that 
differ)  between  genotypes  m  and  n.  Equation  5  gives  the  joint  probability 
of  mutating  dmn  sites  with  probability  p/3  (one  out  of  three  possibilities 
from  ntom),  while  leaving  L  —  dmn  sites  alone  with  probability  1  —  p.  In 
the  actual  simulations,  equation  5  is  a  consequence  of  the  algorithm  rather 
than  an  input,  because  the  target  genotype  m  is  determined  by  randomly 
mutating  each  site  of  genotype  n  with  probability  p/3  into  one  of  three 
possibilities.  The  genome  length  is  a  multiple  of  3  such  that  L  =  3 La,  where 
La  is  the  number  of  amino  acids.  A  given  genotype  n  is  translated  into  the 
corresponding  amino  acid  sequence  a(n)  using  the  standard  genetic 
code.  The  replication  rate  rn  is  a  function  of  amino  acid  sequence  only: 
rn  =  ra(ny  Equations  3  and  4  represent  the  death  of  an  infected  cell  and 
the  clearance  of  a  virion,  respectively,  which  are  assumed  to  have  the 
same  rate  b. 

The  dynamic  model  was  simulated  stochastically  by  the  Gillespie  al¬ 
gorithm  (31),  using  as  the  initial  condition  a  given  number  of  uninfected 
cells  ( L/0),  founder  viruses  (V0),  and  no  infected  cells.  The  founder  group 
consisted  of  virions  with  a  monoclonal  (WT)  genotype.  Simulations  pro¬ 
ceeded  by  dynamically  updating  the  list  of  genotypes  and  corresponding 
amino  acid  sequences  encountered  as  the  viral  population  became  more 
diverse  (28),  instead  of  enumerating  all  possible  sequences,  which  become 
exponentially  numerous  even  for  small  La. 

For  comparison,  we  also  considered  an  extended  model  for  adaptation 
in  animal  hosts  with  immune  response  for  which  equation  3  is  replaced  by 

In  -*  I„  +  Tn  (stimulation)  (6) 

S 

4  +  Tn  Tn  (killing)  (7) 

Tn  -*■  0  (clearance)  (8) 

b 

where  T„  denotes  a  cytotoxic  T  lymphocyte  specific  to  genotype  n,  stim¬ 
ulated  with  rate  s  by  the  presence  of  infected  cells  and  capable  of  killing 
them  with  rate  c. 

Serial-passage  protocol.  To  model  serial  passages,  we  implemented  a 
stochastic  simulation  algorithm  where  the  monoclonal  founder  group 
underwent  a  growth  process  for  a  fixed  amount  of  time  (t),  and  the  re¬ 
sulting  quasispecies  was  randomly  sampled  to  form  a  new  polyclonal 
founder  group  for  the  next  round  of  passage.  All  free  virions  were  sam¬ 
pled,  while  infected  cells  were  excluded.  A  given  virion  has  a  probability 
/  =  V0/V  <  1  to  be  sampled,  such  that  one  round  of  growth  producing 
population  size  V  is  followed  by  the  next  round  with  an  initial  number  fV 
of  viruses.  This  procedure  leaves  the  set  of  frequencies  of  genotypes  pres- 
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ent  before  and  after  the  bottleneck  event  roughly  the  same.  Strains  with 
small  frequencies,  however,  often  disappear  after  a  bottleneck  event  (“ex¬ 
tinctions”). 

Simulation  under  random  landscapes.  We  first  performed  two  dif¬ 
ferent  classes  of  simulations:  those  in  which  the  founder  viruses  are  al¬ 
ready  well  adapted  to  the  host  and  those  where  the  founder  group  faces  a 
novel  host  environment.  In  both  cases,  a  key  input  to  the  model  was  the 
fitness  function  ra,  for  which  we  first  adopted  random  fitness  landscapes: 
the  fitness  function  was  modeled  as  a  Gaussian  random  variable  with 
mean  (ra)  =  r0  exp {—dJQ  and  standard  deviation  cr,  where  da  is  the 
distance  of  sequence  a  (amino  acid  sites  that  are  different)  from  a  refer¬ 
ence  with  fitness  r0,  and  ^  is  a  characteristic  distance  of  the  fitness  decay. 
We  used  ij  =  1  amino  acid  (aa)  and  a  =  0. 1  unless  otherwise  specified.  The 
choice  of  reference  sequence  ( da  =  0)  distinguishes  the  two  scenarios 
without  and  with  adaptations:  in  the  former,  the  reference  sequence  co¬ 
incides  with  the  founder  WT  sequence,  while  in  the  latter,  they  are  dis¬ 
tinct.  The  reference  (i.e.,  the  “most  fit”  [MF]  sequence)  was  taken  as  a 
random  nucleotide  sequence  of  a  certain  length,  translated  into  the  amino 
acid  sequence,  and  assigned  the  fitness  r0.  The  results  shown  were  ob¬ 
tained  typically  by  averaging  >100  different  realizations,  each  simulated 
with  different  random  sequences  under  the  given  parameter  set  and 
length/distance  specifications.  Dependence  on  the  sample  size  of  realiza¬ 
tions  indicated  rapid  convergence  within  less  than  ~  100  realizations,  with 
consistently  large  variances  (data  not  shown). 

Simulations  of  passages  without  adaptation  were  started  with  these 
MF  sequences  as  the  founder  group.  For  simulations  of  passages  with 
adaptation,  WT  and  MF  sequences  in  each  realization  were  generated  by 
first  assigning  the  WT  randomly  and  then  subsequently  mutating  it  with  a 
p  of  0.1  until  a  sequence  with  the  desired  amino  acid  distance  was  ob¬ 
tained.  The  latter  sequence  was  designated  the  MF  sequence. 

The  adaptation  process  studied  in  this  work  was  defined  by  the  explo¬ 
ration  of  sequence  space  by  the  viral  population  during  passages  starting 
from  a  founder  population  with  a  suboptimal  fitness  (WT),  leading  to  the 
discovery  and  dominance  of  the  MF  sequence.  Simulations  were  per¬ 
formed  with  the  genomic  segment  length  La  =  d ,  where  d  is  the  distance 
between  WT  and  MF  sequences.  As  a  characteristic  measure  of  the  effi¬ 
ciency  of  adaptation,  we  probed  the  time  required  (“jumping  time”)  for 
the  MF  sequence  to  be  discovered  and  its  frequency  (calculated  by  sum¬ 
ming  the  numbers  of  free  virions  and  infected  cells)  to  reach  50%  of  all 
sequences.  The  inverse  of  this  time  was  defined  as  the  “jumping  rate”  (/), 
a  stochastic  quantity  with  a  wide  distribution  including  /  =  0,  which 
corresponds  to  cases  where  the  trajectory  never  reached  the  MF  sequence. 
The  maximum  simulation  time  was  set  such  that  it  was  much  longer  than 
the  typical  jumping  times  for  trajectories  that  did  reach  MF.  Realizations 
that  became  stuck  with  a  nonzero  distance  to  MF  exhibited  stationary  MF 
frequencies  of  <50%  (often  zero)  beyond  the  time  maximum  and  were 
counted  as  having  a  zero  jumping  rate. 

Inference  of  fitness  landscapes.  To  model  specific  evolutionary  adap¬ 
tations  more  directly  linked  to  experimental  studies,  we  derived  empirical 
fitness  landscapes  of  HA  sequences  for  influenza  H5N1  viruses  and  used 
them  as  input  to  our  stochastic  model.  For  the  fitness  inference,  we  ad¬ 
opted  a  modified  version  of  the  direct  coupling  analysis  algorithm  (32, 33) 
and  used  it  to  infer  the  fitness  landscape  of  a  model  influenza  HA  protein 
segment.  In  this  inference  procedure,  one  employs  an  empirical  func¬ 
tional  form  of  the  stationary  distribution  of  genotypes  that  maximizes  its 
entropy  and  infers  the  parameters  in  this  expression  using  MSAs.  We 
downloaded  influenza  virus  sequence  sets  from  the  NCBI  (35):  for  H5,  we 
used  2,085  full-length  HA  sequences  of  H5N1,  excluding  laboratory 
strains,  while  for  H3,  we  used  4,113  full-length  HA  sequences  of  H3 
strains. 

For  inference,  the  MSAs  were  used  to  derive  the  empirical  single-site 
and  two-site  amino  acid  distributions,  f(a)  and  /7(a,p5),  each  giving  the 
frequency  of  amino  acid  a  on  site  i  and  the  joint  frequency  of  a  and  |3  on 
sites  i  and  j,  respectively.  A  prior  distribution  p0(a)  with  a  count  n0  was 
used  in  frequency  calculations,  such  that  the  total  effective  number  of 


sequences  was  n0  +  Ns,  where  Ns  is  the  actual  sequence  count.  For  the 
prior  distribution  p0( a),  we  used  the  frequency  values  from  the  Jones- 
Taylor-Thornton  matrix  (36),  augmented  by  p0  =  0.01  for  the  gap,  and 
renormalized  the  2 1  values.  The  prior  count  n0  values  used  were  3,000  and 
4,000  for  H5  and  H3,  respectively.  The  set  of  empirical  distributions,  f(a) 
and/jj(c<,p),  were  then  matched  to  the  maximum-entropy  expression  of 
the  probability  of  a  sequence  q  =  (oq, .  .  .,  otLa),  p(q)  ~  exp (rq/rs),  where 

rq  =  2  +  2  aj)  (9) 

<  « <} 

was  obtained  using  the  mean-field  approximation  (33),  where  the  matrix 
inverse  of/7(a,B)  gives  the  parameters  C^(ol,P)  after  excluding  a  reference 
amino  acid,  which  we  took  as  Ala,  from  the  argument  list.  In  our  applica¬ 
tion,  the  frequency  data  for  the  La  =  4  minigenome  (103,  156,  222,  and 
224  [H5  numbering],  identified  as  the  locations  of  key  mutations  in  HA 
during  H5N1  adaptation  in  ferrets  [6])  were  used  to  infer  the  parameter 
set  C;/a,P)  and  h{( a).  The  inference  only  determined  fitness  values  up  to 
an  additive  constant  (in  addition  to  the  reference  fitness  value  rs),  which 
we  chose  such  that  genotypes  neighboring  the  MF  sequence  had  suitable 
replication  rates  within  the  stochastic  model. 

Simulation  of  influenza  virus  adaptation.  We  used  the  inferred  H3 
fitness  landscape  as  an  input  to  the  stochastic  simulations  of  H5  WT 
adaptation  in  mammalian  host  environments.  Each  simulation  was 
started  with  the  WT  genotype  CATACACAAGGA  (HTQG)  of  H5N 1  with 
a  different  random  number  seed,  and  each  mutant  generated  in  the  sim¬ 
ulations  was  dynamically  assigned  the  fitness  calculated  by  the  inferred 
fitness  landscape  parameters.  The  reference  fitness  value  was  set  as  rs  =  1 
day-1.  The  WT  and  MF  (SAQG)  fitness  values  were  4.9  and  21.7  (in  units 
of  rs),  respectively.  This  setup  mimics  the  adaptation  of  H5  under  the 
selective  environments  native  to  H3  strains. 

RESULTS 

Stochastic  model.  The  basic  model  of  stochastic  virus  evolution 
used  in  this  study  involved  the  following  set  of  species:  uninfected 
host  cell  U,  a  virion  with  genotype  n  (denoted  by  V„),  and  a  cell 
infected  with  the  virion  In.  They  interacted  according  to  the  fol¬ 
lowing  set  of  events:  an  infection  of  a  host  cell  with  rate  a,  a  repli¬ 
cation  event  where  In  produced  Vm  with  replication  rate  rn  and 
mutation  probability  Qmn  (a  function  of  the  mutation  rate  p,  de¬ 
fined  as  the  probability  of  making  a  substitution  error  per  nucle¬ 
otide  site  per  replication),  and  two  additional  events  in  which  an 
infected  cell  dies  and  a  virion  is  cleared  with  rate  b  (see  Materials 
and  Methods).  We  adopted  the  same  rate  for  the  last  two  processes 
for  simplicity.  The  clearance  of  viruses  may  be  viewed  as  reflecting 
the  effect  of  any  innate  and/or  acquired  immune  responses  pres¬ 
ent  if  passages  occur  in  live  animals  or  unspecified  decay  mecha¬ 
nisms  in  cell  cultures. 

If  the  mutation  rate  p  is  zero  ( Qmn  =  1  if  m  =  n  and  0  other¬ 
wise),  the  continuum  approximation  to  equations  1  to  4  becomes 
equivalent  to  the  target  cell-limited  model  used  by  Baccam  et  al. 
(20)  and  Pawelek  et  al.  (37)  to  study  the  kinetics  of  influenza  A 
virus  infection.  We  compared  the  full  discrete  stochastic  simula¬ 
tion  with  the  outcomes  of  the  continuum  differential  equation 
representation  and  found  that  the  latter  severely  overestimated 
the  speed  of  adaptation  (explored  in  detail  in  the  following  sub¬ 
sections)  because  it  assigned  nonzero  frequencies  to  all  genotypes 
via  equation  5  at  all  times  t  of  >0  (data  not  shown).  Only  for  a 
simplified  model  with  L  =  3  nt  (64  possible  genotypes)  and  an 
artificially  high  mutation  rate  p  of  ~0.1  was  the  difference  be¬ 
tween  the  two  methods  relatively  small.  As  p  decreased  (or  con¬ 
versely  with  increasing  L,  which  increases  the  total  number  of 
genotypes  exponentially  while  the  population  size  remains  small), 
the  population  dynamics  quickly  became  dominated  by  the  finite- 
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FIG  1  Viral  growth,  host  cell  infection,  and  within-host  diversification  dy¬ 
namics  predicted  by  the  stochastic  model  for  the  following  parameter  values: 
La  =  3  aa,  Vq  =  1  X  103,  a  =  1.0  X  10-3  day-1,  b  =  3.0  day-1,  r0  =  6.0  day-1, 
fx  =  1.0  X  10-5,  and  U0=  IX  106.  (A)  Viral  titer  and  uninfected  cell  number 
U0.  Symbols  represent  the  experimental  data  from  references  20  and  38,  plot¬ 
ted  assuming  a  scaling  factor  of  1  viral  count  per  1  TCID50/ml.  (B)  Total 
numbers  of  genotypes  (nucleotide  and  amino  acid  sequences)  as  a  function  of 
time. 


size  effects:  stochastic  simulations  of  the  full  discrete  model  pre¬ 
dicted  long  waiting  times  for  the  discovery  of  MF  genotype,  while 
the  continuum  approximation  yielded  a  rapid  rise  in  MF  fre¬ 
quency.  Therefore,  to  realistically  model  the  time-dependent  ad¬ 
aptation  behavior  of  virus  populations,  it  is  important  to  take  into 
account  the  discrete  nature  of  population  sizes. 

Viral  growth  curve.  The  typical  time-dependent  behavior  of 
virus  growth  and  uninfected  cell  depletion  for  a  single  cell  culture 
or  single  host  is  shown  in  Fig.  1  A,  along  with  the  experimental  data 
of  influenza  A  virus  FI  INI  infection  kinetics  from  references  20 
and  38.  The  total  virus  count  V  in  our  model  output  was  assumed 
to  be  proportional  to  the  experimentally  measured  viral  titers.  For 
comparison,  we  assumed  an  empirical  scaling  factor  such  that  1 
virus  count  corresponded  to  1  50%  tissue  culture  infective  dose 
(TCID50)/ml.  During  simulations,  the  initial  clonal  founder  virus 
group  (V0  virions)  underwent  a  short  period  of  clearance  before 
the  target  cells  (initially  U0)  became  infected  appreciably.  The  viral 
titer  began  to  rise  as  infection  and  replication  occurred,  whereas 
the  number  of  uninfected  cells  decreased.  The  growth  stopped 
when  the  target  cells  became  depleted  and  was  followed  by  gradual 
cell  death  and  virus  clearance  processes.  The  parameter  values 
were  chosen  to  obtain  a  good  agreement  in  virus  count  between 
the  simulation  and  the  experimental  data.  The  value  of  a  primarily 
affected  the  shape  of  the  initial  rise  in  virus  count,  with  larger 
values  leading  to  a  flatter  curve  up  to  ~2  days  (initial  virions 
rapidly  consumed  by  infection)  followed  by  a  sharper  increase 
(infected  cells  start  producing  copied  virions).  The  value  of  b  af¬ 
fected  the  decay  rate  after  saturation,  with  larger  values  leading  to 
a  faster  decrease  in  virus  count. 

This  overall  behavior  of  total  species  count  (uninfected  cells 
and  viruses  of  all  genotypes)  shown  in  Fig.  1  was  qualitatively 
similar  to  those  obtained  with  a  kinetic  model  by  Baccam  et  al. 


(20).  However,  the  discrete  stochastic  model  provided  previously 
unavailable  and  realistic  views  of  the  genomic  diversification  pro¬ 
cess  accompanying  this  growth  (Fig.  IB).  The  results  shown  in  Fig. 
1  were  obtained  by  averaging  over  many  explicit  representations 
of  La  =  3  aa  (L  =  9  nt)  genomic  segments  with  random  sequences 
and  decreasing  fitness  distributions  of  mutants  around  the  WT 
with  increasing  distance  (see  Materials  and  Methods),  a  choice 
that  has  been  empirically  validated  using  experimental  data  (28). 
The  numbers  of  distinct  nucleotide  and  amino  acid  sequences 
increased  with  the  viral  titer  growth,  reaching  a  peak  together  with 
the  virus  count.  The  cell  death/clearance  phase  (time,  3  to  8  days) 
led  to  a  restoration  of  the  monoclonal  population,  which  was  due 
to  the  relatively  low  mutation  rate  (x  of  1 X  10~5,  close  to  typical 
values  estimated  for  RNA  viruses  (39,  40).  The  maximum  num¬ 
bers  of  genotypes  shown  in  Fig.  IB,  ~25  nt  sequences  (and  ~17  aa 
sequences),  in  fact,  are  close  to  the  number  of  all  possible  single 
mutants  plus  the  WT,  1  +  3L  =  28  nt  sequences,  which  suggests 
that  with  the  given  mutation  rate  value,  single  mutants  dominated 
the  within-host  quasispecies  for  small  genomic  segments  of  La 
on  the  order  of  1  aa.  In  summary,  in  addition  to  illustrating  qual¬ 
itative  behavior  of  viral  growths  in  agreement  with  experimental 
trends,  our  model  also  revealed  that  genomic  diversification  is 
maximal  when  the  population  size  grows  rapidly. 

Serial  passages.  We  extended  the  basic  stochastic  model  given 
by  equations  1  to  4  to  simulate  the  typical  growth  and  evolution¬ 
ary  dynamics  encountered  in  experimental  serial  passages.  In  this 
protocol,  simulations  of  single-host  growth  as  shown  in  Fig.  1 
were  terminated  at  a  certain  time  (passage  interval  t,  chosen  near 
the  viral  count  maximum),  and  the  quasispecies  population  was 
sampled  with  equal  probability  for  all  virions  present  to  produce  a 
founder  group  of  virions  to  infect  fresh  new  target  cells  (with  the 
same  U0)  for  the  next  round  of  the  passage.  The  “bottleneck  fac¬ 
tor”  /<  1,  defined  as  the  ratio  of  the  initial  viral  population  size  of 
a  round  to  the  final  viral  population  size  of  the  previous  round 
(and  controlled  by  varying  the  founder  group  size  V0),  quantifies 
the  strength  of  selection  pressure  exerted  on  viruses:  smaller  / 
values  cause  only  the  most  dominant  genotypes  (higher  frequen¬ 
cies)  to  survive  these  bottleneck  events.  The  survival  probability 
therefore  depends  on  how  efficiently  a  given  strain  infects  the 
limited  number  of  target  cells  to  replicate  in  competition  with 
others. 

We  first  examined  the  simpler  case  in  which  the  viral  popula¬ 
tion  was  already  well  adapted  to  the  target  cell  environment,  and 
no  systematic  evolutionary  drift  in  sequence  space  was  expected. 
This  situation  was  modeled  by  using  a  fitness  function  centered 
around  the  WT  (the  genotype  corresponding  to  the  founder 
group  of  the  first  passage).  Figure  2  shows  the  typical  time-depen¬ 
dent  behavior  of  the  virus  population  size  and  genomic  diversity. 
The  total  viral  count  repeated  the  growth  pattern  after  a  bottle¬ 
neck  event  with  the  passage  period  t  of  3  days,  with  the  initial  value 
at  the  beginning  of  a  round  kept  constant  as  V0.  The  smaller  value 
of  death/clearance  rate  b  in  simulations  shown  in  Fig.  2  compared 
to  Fig.  1  resulted  in  flatter  plateau  regions  near  the  viral  count 
peaks,  where  the  genomic  diversity  was  maintained  near  the  sin¬ 
gle-mutant  maximum  level  of  ~28  nt  sequences.  Although  the 
bottleneck  selection  cuts  this  diversity  down  to  ~1  (WT)  in  this 
particular  case,  we  found  the  initial  number  of  strains  at  each 
round  to  vary  depending  on  f  approaching  the  maximum  for 
larger /values. 

The  combined  frequency  of  mutants  shown  at  the  bottom  part 
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FIG  2  Time  dependence  of  viral  population  size  and  diversity  over  serial 
passage  simulations  without  adaptation.  The  WT  and  MF  coincide  in  the  se¬ 
quence  space.  The  parameter  values  were  the  same  as  in  Fig.  1  except  that  b  = 
1.0  day-1  and  t  =  3  days. 


of  Fig.  2  indicates  that  during  the  initial  growth  process  in  a  given 
passage  round,  the  rapid  population  growth  with  mutations  led  to 
a  sharp  increase  in  diversity.  Then,  as  the  target  cells  were  de¬ 
pleted,  due  to  an  increase  in  viral  population,  the  mutant  fre¬ 
quency  decreased  again  after  ~1.5  days.  Because  in  this  case  the 
fitness  distribution  was  centered  at  the  WT,  the  viral  population 
reverted  to  the  WT-dominant  structures  after  each  moderate  di¬ 
versification  stage  under  selection.  We  examined  the  trend  in  the 
total  number  of  genotypes  present  after  the  end  of  each  passage 
round  as  a  function  of  passage  number  up  to  30  passages  and 
confirmed  that  under  this  condition,  each  passage  round  was  sta¬ 
tistically  independent  of  the  previous  round  (data  not  shown).  In 
summary,  our  model  allows  for  a  simple  description  of  repeated 
bottleneck  events  under  serial  passages.  If  the  initial  population  is 
already  well  adapted  to  the  host  environment,  the  selection  events 
did  not  affect  the  quasispecies  structure  appreciably. 

Adaptation.  More  typical  in  experimental  serial  passages  is  the 
situation  in  which  the  WT  is  exposed  to  a  new  host  environment 
in  which  its  fitness  is  not  optimal.  Evolutionary  dynamics  then 
exhibit  an  adaptation  process  where  systematic  drifts  toward  more 
fit  genotypes  occur  within  a  characteristic  timescale,  which  we 
refer  to  as  the  (species)  “jumping  time.”  In  our  modeling,  the 
adaptation  process  was  defined  as  the  exploration  of  sequences 
near  the  initial  WT  during  passages  and  the  discovery  of  more  fit 
genotypes  leading  to  the  increase  in  their  frequencies.  We  adopted 
fitness  distributions  where  a  WT  was  chosen  at  a  certain  distance  d 
from  the  MF  genotype.  We  found  this  distance  to  MF  to  be  the 
dominant  factor  influencing  the  relative  probability  of  adaptation. 
The  jumping  time  was  defined  as  the  time  required  since  the  ini¬ 
tiation  of  passages  for  the  MF  genotype  to  be  discovered  and  for  its 
frequency  to  reach  50%  of  the  viral  population. 

Figure  3A  shows  typical  evolutionary  dynamics  of  such  an  ad¬ 
aptation  process  from  our  simulations  averaged  over  100  realiza¬ 
tions,  where  d  =  2  aa  and  other  parameter  values  were  chosen  to 
best  illustrate  the  typical  behavior  in  simulations  under  these  set¬ 
tings.  During  the  earlier  rounds  where  the  population  was  domi¬ 
nated  by  the  WT  with  a  lower  fitness,  the  total  virus  count  reached 
was  relatively  smaller.  It  gradually  increased  with  the  increasing 


FIG  3  Time  dependence  of  population  size  and  diversity  (A)  as  well  as  mean 
distance  to  the  MF  sequence  and  the  frequency  of  the  MF  (B) .  Parameter  values 
were  as  follows:  La  =  d  =  2  aa,  VQ  =  1  X  104,  U0  =  1  X  106,  a  =  1.0  X  1 0  3 
day-1,  b  =  1.0  day-1,  r0  =  20  day-1,  p,  =  1.0  X  10-5,  and  t  =  3  days. 


number  of  passages  as  mutants  with  higher  fitness  values  appeared 
and  displaced  the  WT.  Figure  3B  shows  variations  of  the  mean 
distance  to  the  MF  and  the  frequency  of  the  MF.  The  mean  dis¬ 
tance  roughly  decreased  monotonically  with  the  number  of  pas¬ 
sages,  whereas  the  MF  frequency  did  not  increase  appreciably 
from  zero  until  the  third  passage  round:  the  adaptation  at  the  early 
stage  primarily  arose  from  mutants  other  than  the  MF,  which  only 
appeared  after  a  considerable  period  of  “search”  in  the  sequence 
space.  Both  the  approach  to  the  MF  and  the  increase  in  MF  fre¬ 
quency  occurred  in  graduated  “steps”  with  most  changes  concen¬ 
trated  in  the  early  phase  of  each  passage  round,  where  active 
growth  occurred  and  competition  among  strains  was  most  severe. 

The  late-stage  limiting  values  of  the  mean  distance  and  fre¬ 
quency  after  many  rounds  of  passages  were  fairly  distant  from  zero 
and  one,  respectively.  We  found  that  this  feature  was  primarily 
due  to  the  sizeable  presence  of  realizations  in  which  evolutionary 
dynamics  got  “stuck”  in  mutants  distinct  from  the  MF  but  with 
fitness  values  larger  than  that  of  WT,  rather  than  a  consistent 
convergence  to  these  average  limiting  values.  This  feature  can  be 
seen  more  clearly  in  Fig.  4,  which  shows  the  distribution  of  the 
jumping  rate  /,  defined  as  the  inverse  of  the  jumping  time.  The 
distribution  is  multimodal,  with  broad  peaks  at  larger,  finite  rates 
and  a  nonzero  probability  of  zero  rate.  The  latter,  more  pro¬ 
nounced  for  d  =  2  aa  (Fig.  4B),  can  be  attributed  to  the  topological 
structure  of  the  genetic  code:  transitions  via  substitution  errors 
between  a  significant  number  of  pairs  of  amino  acids  require  mul¬ 
tiple  mutations,  making  them  highly  unlikely.  A  population  can 
become  dominated  by  a  genotype  close  to  the  MF  in  distance  but 
not  directly  accessible.  The  fine  structure  of  the  distributions 
shown  in  Fig.  4  reflects  the  characteristics  of  passage  dynamics:  the 
rightmost  cluster  of  peaks  in  Fig.  4A  {d  =  1  aa)  corresponds  to 
cases  in  which  the  MF  frequency  exceeded  50%  during  the  first 
passage  cycle,  the  cluster  to  its  left  the  second  cycle,  and  so  on.  For 
a  d  of  2  aa,  the  probability  of  jumping  during  the  first  passage  cycle 
was  virtually  zero  (Fig.  4B).  The  gaps  between  these  clusters  ap- 
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FIG  4  Jumping  rate  (/)  distributions  from  the  sets  of  I03  realizations  using 
parameter  values  as  in  Fig.  3  for  two  different  distances:  d  =  1  aa  (A)  and  d  = 
2  aa  (B).  The  dashed  lines  show  maximum-likelihood  fits  to  linear  combina¬ 
tions  of  normal  distributions.  The  cluster  of  peaks  in  panel  A  centered  at  a  /  of 
=0.3  day-1,  0.16  day-1,  0.11  day-1,  and  0.07  day-1  correspond  to  jumping 
events  (MF  frequency  exceeding  50%)  during  the  first,  second,  third,  and 
fourth  passage  rounds,  respectively.  The  peaks  at  a  /  of  0  arise  from  trajectories 
trapped  at  sequences  without  direct  single  substitution  route  to  the  MF  se¬ 
quence. 

peared  because  large-scale  changes  to  frequency  distribution  oc¬ 
cur  mostly  during  the  early  growth  stage  within  each  round  (Fig. 
3B).  The  jumping  rate  correlates  with  the  general  likelihood  of 
observing  a  given  adaptation  during  passages.  To  support  this 
conclusion  quantitatively,  we  fitted  the  distributions  in  Fig.  4  us¬ 
ing  maximum-likelihood  estimation  to  the  functional  form  of  a 
sum  of  normal  distributions  with  the  amplitude,  mean,  and  stan¬ 
dard  deviations  as  parameters,  shown  in  Fig.  4  as  dashed  lines.  The 
jumping  time  in  units  of  passage  numbers  for  the  modes  with 
significant  amplitudes  (the  peaks  in  Fig.  4  from  right  to  left)  were 
1.2,  2.1,  and  3.1  for  Fig.  4A  and  2.1,  3.1,  4.1,  and  5.1  for  Fig.  4B, 
confirming  that  jumping  events  mostly  occur  near  the  beginning 
of  each  passage  cycle.  In  summary,  the  model  allowed  for  quanti¬ 
tative  characterizations  of  the  speed  of  adaptation  in  terms  of  the 
time  required  for  a  highly  fit  genotype  to  appear  and  dominate  the 
quasispecies  population  under  passages. 

Variations  of  jumping  rate.  We  examined  the  dependence  of 
jumping  rate  /,  which  is  a  measure  of  the  relative  probability  of 
observing  a  certain  adaptation  from  a  WT  to  a  MF  genotype  dur¬ 
ing  passages,  on  mutation  rate  p,  (Fig.  5).  For  a  given  distance  d 
between  the  WT  and  MF,  the  jumping  rate  monotonically  in¬ 
creased  with  increasing  p.  The  large  variance  in  jumping  rate 
showed  little  sample-size  dependence  and  reflected  the  contribu¬ 
tion  of  pathways  inaccessible  to  the  MF  genotype  in  sequence 
space,  as  well  as  the  multimodal  structure  of  its  distribution  over 
the  initial  phase  of  many  passage  cycles  (Fig.  4).  The  relative  am¬ 
plitudes  of  normal  distributions  fitted  to  the  data  in  Fig.  4A  and  B 
were  distributed  evenly,  with  proportions  of  0.5,  0.3,  and  0.1  and 
0.2,  0.2,  and  0.2,  respectively,  for  the  first  3  main  peaks  from  the 
right.  To  examine  the  effects  of  heterogeneity  in  substitution  rates 
at  different  sites,  which  would  become  more  important  for  longer 
sequences,  we  also  considered  the  case  where  the  mutation  rate  p 
at  each  nucleotide  site  was  sampled  from  a  gamma  distribution 
(41).  For  an  I  of  6  nt,  the  site  heterogeneity  did  not  affect  the 


FIG  5  Dependence  of  jumping  rate  on  the  mutation  rate  for  the  WT-to-MF 
distance  from  1  to  3  aa.  Symbols  represent  averages  and  error  bars  represent  1 
standard  deviation  over  100  realizations.  Parameter  values  were  as  follows: 
V0  =  1  X  105,  U0  =  1  X  106,  a  =  1.0  X  10-3  day-1,  b  =  0.1  day-1,  r0  =  10 
day-1,  and  t  =  3  days. 

jumping  rate  appreciably  (data  not  shown).  Its  effect,  however, 
may  become  stronger  for  longer  genomic  segments,  for  which 
deep-sequencing  data  of  viral  genomes  (42)  could  be  used  to  as¬ 
sign  site  and  nucleotide-dependent  substitution  rates. 

While  jumping  rate  values  became  statistically  similar  for  dif¬ 
ferent  distances  near  p  of  ~  1 0 - 3  (Fig.  5 ) ,  they  differed  widely  near 
the  experimental  range  of  mutation  rates  of  ~10-5,  where  /  de¬ 
creased  exponentially  with  increasing  d,  reaching  /  of  ~10-2 
day- 1  for  a  d  of  3  aa,  or  a  jumping  time  of  ~  100  days  (33  passages). 
From  the  results  shown  in  Fig.  5,  we  may  infer  estimates  for  the 
typical  number  of  passages  (of  3  days  for  each  round)  that  would 
be  required  under  a  typical  mutation  rate  of  ~10-5  to  arrive  at 
and  jump  to  an  MF  sequence:  3,  6,  and  33  passages  for  1-,  2-,  and 
3-aa  distances,  respectively. 

The  dependences  of  the  jumping  rate  with  other  physical  pa¬ 
rameters  are  shown  in  Fig.  6.  The  host  cell  number  U0  determines 
the  viral  population  size  reached  within  each  passage  round.  The 
jumping  rate  monotonically  increased  with  increasing  l/0,  satu¬ 
rating  near  ~0.1  day-1  (Fig.  6A).  The  bottleneck  size  V0  for  a 
given  U0  represents  the  strength  of  the  selection  pressure,  deter¬ 
mining  the  bottleneck  factor/.  Although  at  first  the  jumping  rate 
increased  with  increasing  V0  for  small  V0,  it  reached  a  maximum 
and  began  to  decrease  as  V0  approached  U0  (Fig.  6B).  For  small  V0, 
more  fit  mutants  found  during  a  passage  round  frequently  were 
lost  because  of  their  low  frequencies,  whereas  for  large  enough  V0, 
they  usually  survived  the  bottleneck  selections.  The  increase  of  / 
with  increasing  V0  for  small  V0  therefore  is  a  gradual  attenuation 
of  the  finite-size  effect.  For  a  V0  of  >104  (or/of  >10-2  for  V  of 
106),  the  starting  viral  population  at  each  round  was  sufficiently 
large,  making  a  smaller  bottleneck  size  more  favorable  for  adap¬ 
tation. 

The  passage  period  t  affects  the  jumping  rate  primarily  via  the 
size  and  frequency  of  viral  populations  at  which  bottleneck  events 
occur.  The  approximate  increase  of  /  with  decreasing  t  (Fig.  6C) 
can  be  explained  by  the  increase  in  the  fraction  of  time  a  popula¬ 
tion  experienced  strong  competition  via  rapid  growth  during  each 
passage  round  (Fig.  3).  If  the  period  is  too  short  (t  =  1  day),  this 
growth  phase  begins  to  be  interrupted  by  bottlenecks,  decreasing  / 
again.  We  found  the  jumping  rate  J  to  be  insensitive  to  variations 
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FIG  6  Dependence  of  jumping  rate  on  other  parameters.  (A)  Number  ofhost 
cells  U0.  (B)  Bottleneck  size  V0,  or  the  number  of  virions  selected  at  each 
passage  round.  (C)  Passage  period  t,  or  the  time  duration  of  each  passage 
round.  (D)  Infection  rate  a.  (E)  Rate  b  of  infected  cell  death  and  virus  clear¬ 
ance.  (F)  Fitness  landscape  peak  width  parameter  £  of  the  MF  sequence.  De¬ 
fault  values  of  parameters  other  than  those  varied  in  each  case  were  the  same  as 
in  Fig.  5,  with  d  =  2  aa,  p,  =  1.0  X  Id-5,  and  £,  =  1  aa. 


in  infection  rate  a  (Fig.  6D).  The  dependence  on  death/clearance 
rate  b  was  similar  (Fig.  6E),  except  when  b  increased  beyond  1 
day-1,  where  cell  death  and  clearance  began  to  annihilate  the 
growth  phase.  Finally,  we  varied  the  characteristic  width  parame¬ 
ter  £;  of  the  fitness  peak  (Fig.  6F),  where  increasing  £  corresponded 
to  the  fitness  peak  at  the  MF  sequence  becoming  broader,  which 
sharply  reduced  the  jumping  rate:  the  probability  for  adaptation 
was  reduced  when  the  associated  fitness  gain  decreased.  For  a  ^  of 
<  1  aa,  fitness  values  of  the  WT  and  its  neighbors  are  too  small  to 
support  adaptation.  Overall,  the  global  picture  shown  in  Fig.  6  of 
the  dependence  of  the  jumping  rate  /  on  different  variables  other 
than  distance  suggests  that  the  efficiency  of  adaptation  is  much 
more  sensitive  to  the  characteristics  of  bottleneck  events  than  host 
cell  dynamics. 

Effects  of  immune  response.  For  serial  passages  in  animal 
hosts  (43,  44),  the  action  of  innate  and  adaptive  host  responses 
may  affect  the  dynamics  of  adaptation.  To  examine  such  effects, 
we  considered  an  extended  model  in  which  host  cells  infected  with 
a  genotype  stimulate  an  immune  cell  (cytotoxic  T  lymphocyte) 
specific  to  the  genotype,  which  can  kill  infected  cells  with  the  same 
genotype  (equation  3  replaced  by  equations  6  to  8).  A  similar 
implementation  is  possible  with  the  inclusion  of  antibodies  inter¬ 
acting  with  viruses  rather  than  infected  cells.  We  found  the  adap¬ 
tation  dynamics  of  the  extended  model  (Fig.  7)  to  be  qualitatively 
similar  to  Fig.  3,  which  suggests  that  the  basic  model  defined  by 


FIG  7  Adaptation  dynamics  of  the  extended  model  given  by  equations  1,  2, 
and  4  and  6  to  8  explicitly  including  host  immune  response.  Parameter  values 
were  the  same  as  in  Fig.  3  except  for  b  =  c  =  0.1  day-1  and  s  =  0.01  day-1. 

equations  1  to  4  may  be  regarded  as  an  effective  model  indirectly 
reflecting  the  selection  pressure  of  immune  response. 

Influenza  A  virus  H5N 1  adaptation.  To  gain  insight  into  how 
the  generic  behavior  of  adaptation  statistics  examined  as  described 
above  for  random  sequences  manifest  themselves  in  specific  virus 
evolution,  we  have  modeled  the  avian  influenza  virus  H5N1  evo¬ 
lutionary  adaptation  investigated  in  recent  experimental  studies 
(6,  27).  We  obtained  fitness  landscapes  specific  to  influenza  virus 
HA  sequences  by  adopting  the  direct  coupling  analysis  (32,  33)  for 
fitness  inference.  The  inferred  fitness  parameters  allow  one  to 
score  fitness  values  of  arbitrary  genotypes  that  might  be  encoun¬ 
tered  during  evolutionary  adaptation.  We  focused  on  the  seg¬ 
ments  of  H5  HA  protein  identified  as  allowing  the  H5N1  viruses  to 
acquire  the  capacity  to  infect  via  respiratory  routes:  His  103, 
Thrl56,  Gln222,  and  Gly224  (H5  numbering;  we  focused  on  the 
HA  protein  only,  excluding  Glu627  of  the  PB2  protein  reported  in 
the  experiment)  (6).  Upon  serial  passaging  in  ferrets  with  prein¬ 
troduced  mutations  Q222L  and  G224S,  which  had  previously 
been  shown  to  enhance  the  binding  affinity  of  H5  HA  to  human 
influenza  A  virus  receptors  (45),  two  additional  mutations  were 
found  among  strains  exhibiting  the  capability  of  spreading  via 
airborne  routes:  H103Y  and  T156A. 

We  took  the  four  key  residues  at  positions  103,  156,  222,  and 
224  to  form  the  model  protein  segment  of  La  =  4  aa,  for  which  the 
H5  WT  sequence  is  HTQG.  We  used  an  HA  protein  MSA  of  H5N 1 
sequences  to  infer  the  fitness  landscape  of  the  model  segment. 
Figure  8A  shows  the  first  20  genotypes  ranked  by  their  fitness 
values  within  all  possible  sequences.  The  sequence  HAQG  has  a 
fitness  comparable  (slightly  larger)  to  that  of  the  WT,  reflecting 
the  prevalence  of  Ala  156  within  the  sequence  set.  Most  of  the 
high-fitness  mutants  involve  a  single  substitution  at  position  156. 
This  landscape  corresponds  to  the  natural  environment  to  which 
H5  proteins  are  already  well  adapted:  avian  hosts.  To  obtain  the 
analogous  fitness  landscape  of  HA  in  mammalian  hosts,  we  used 
an  MSA  of  H3  sequences.  Figure  8B  shows  the  highest-fitness  ge¬ 
notypes  of  the  corresponding  positions  (110, 160,  226,  and  228  in 
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FIG  8  Fitness  landscapes  of  an  HA  protein  4-aa  segment  inferred  from  MSA.  (A)  H5  landscape  derived  from  H5N1  sequences.  (B)  H3  landscape  from  H3 
sequences.  The  height  of  the  bars  represents  the  fitness  values  in  units  of  rs=  1.0  day-1.  The  abscissa  shows  the  list  of  genotypes  sorted  with  decreasing  order  in 
fitness.  In  panel  B,  the  rightmost  bar  shows  the  H5  WT,  which  is  251st  in  rank. 


H3  numbering),  in  addition  to  the  fitness  of  H5  WT,  HTQG, 
which  is  less  than  25%  of  the  fitness  value  of  the  MF  sequence, 
SAQG.  The  last  two  positions  on  the  segment  are  still  dominated 
by  QG  as  in  H5:  the  enhanced  fitness  of  H5  Q222L/G224S  mutants 
in  H3  environments,  which  a  recent  structural  study  linked  to  the 
cis-trans  change  of  receptor  analogs  (46),  is  not  reflected  strongly 
in  the  natural  H3  populations  when  coupled  mutations  are  taken 
into  account.  We  therefore  modeled  the  adaptation  of  the  first  two 
HA  positions  by  taking  the  sequence  HTQG  as  the  WT  and  SAQG 
as  the  MF  sequence  and  simulating  the  adaptation  of  the  WT 
under  the  H3  fitness  landscape  in  Fig.  8B.  In  these  serial-passage 
simulations,  the  WT  nucleotide  sequence  at  time  zero  was  fixed 
(CATACACAAGGA)  and  the  fitness  values  of  the  appearing  mu¬ 
tants  were  obtained  from  the  computationally  inferred  landscape 
parameters  instead  of  being  randomly  assigned  from  a  distribu¬ 
tion. 

Typical  viral  count  evolutionary  patterns  were  similar  to  Fig.  3, 
while  the  outcomes  of  the  evolutionary  trajectories  in  the  se¬ 
quence  space  varied  greatly  between  different  realizations,  sug¬ 
gesting  that  the  “ruggedness”  of  the  empirically  derived  fitness 
landscape  with  multiple  peaks  and  valleys  significantly  affected  the 
adaptation  process.  The  majority  of  the  trajectories  did  not  reach 
the  MF  sequence  within  ~  103  days  (~300  cycles),  getting  trapped 
in  some  of  the  genotypes  listed  in  Fig.  8B.  One  trajectory  that 
reached  the  MF  sequence  is  illustrated  in  Fig.  9:  the  simulation 
starts  with  a  monoclonal  WT  (HTQG)  population  at  time  zero. 
The  T156A  mutation  already  appears  at  3  days,  reflecting  the  large 
jump  in  fitness  (Fig.  8B)  as  well  as  the  simplicity  of  the  nucleotide 
substitution  required  (A  — »  G).  The  Alal56  is  already  dominant 
after  9  days.  After  a  fairly  long  period  of  explorations,  a  new  mu¬ 
tant  (RAQG)  grows  in  frequency  ( t  =  99  days),  eventually  taking 
over  at  around  150  days.  It  is  closely  followed  by  LAQG,  which 
replaces  RAQG  at  162  days,  and  finally  by  SAQG,  the  MF  se¬ 
quence,  at  174  days.  Afterwards,  the  MF  frequency  grows  to 
around  0.99  and  remains  dominant,  with  the  subpopulation  of 
small  minority  mutants  continually  drifting  around.  The  time- 
dependent  progression  of  the  most  frequent  genotypes  (master 
sequences;  HTQG  —>  HAQG  — >  RAQG  — >  LAQG  — >  SAQG)  cov¬ 
ers  the  fitness  landscape  (Fig.  8B)  with  a  sequential  increase  in 
rank  (251  ->  18  ->  14  ->  5  ->  1)  and  fitness  (4.9  ->  12.3  -»  12.7  -> 
13.4  — >  21.7),  while  reflecting  the  accessibility  of  the  required  mu¬ 


tations  (Fig.  10).  We  calculated  the  mean  frequencies  of  dominant 
genotypes  (HAQG,  QAQG,  LAQG,  and  SAQG)  as  functions  of 
time  by  averaging  over  many  trajectories  (Fig.  11).  The  mean  fre¬ 
quency  of  the  MF  sequence  SAQG  of  ~  10%  with  the  timescale  of 
~300  days  reflects  the  probability  of  jumping  rather  than  the  ac¬ 
tual  composition  in  individual  trajectories:  if  it  occurs,  the  geno¬ 
type  eventually  dominates  the  population  (Fig.  9).  In  summary, 
our  results  show  that  the  basic  stochastic  model  of  adaptation  can 
be  combined  with  sequence-based  empirical  data  to  yield  detailed 
and  experimentally  relevant  predictions  of  virus  adaptation  dy¬ 
namics. 

DISCUSSION 

Our  stochastic  model  of  viral  evolutionary  dynamics  during  serial 
passages  allows  for  quantitative  characterizations  of  the  probabil¬ 
ity  of  observing  systematic  drifts  in  population  structure  and  ad¬ 
aptations  to  a  new  host  environment.  The  discrete  stochastic  rep¬ 
resentation  of  the  viral  evolutionary  dynamics  is  essential  to 
realistically  capture  adaptations  in  finite-size  populations  domi¬ 
nated  by  extinctions  of  minor  variants  during  bottleneck  events, 
which  cannot  be  described  using  viral  copy  numbers  as  continu¬ 
ous  variables. 

We  may  view  the  serial  passage  as  an  idealized  experimental 
setup  with  well-controlled  variables  deemed  important  for  evolu¬ 
tionary  dynamics.  The  ability  to  model  passages  quantitatively, 
therefore,  would  provide  insights  into  analogous  events  occurring 
in  natural  settings,  including  species  jumps  and  outbreaks  (5).  By 
calibrating  the  model  parameters  using  certain  serial  passage  ex¬ 
perimental  data,  one  could  apply  the  model  to  interpretations  of 
experiments  as  well  as  generation  of  predictions  and  testable  hy¬ 
potheses.  The  host  cell  number  U0  would  depend  on  the  plating 
density  for  cell  cultures,  typically  U0  of  ~106  or  less,  whereas  for 
passages  using  live  host  animals  it  would  be  larger  and  less  clear- 
cut.  The  bottleneck  size  Vq,  representing  the  number  of  virions 
sampled  after  each  passage  round  for  initiating  the  next  round, 
could  be  readily  estimated  based  on  the  sampling  protocol  of  the 
experiment.  Appropriate  values  for  the  infection  and  death  rates, 
a  and  b,  could  be  determined  by  fitting  growth  curves  as  shown  in 
Fig.  1,  whereas  the  mutation  rate  p  is  known  empirically  for  many 
viral  pathogens  (39,  40,  47). 

In  addition  to  these  physical  parameters,  the  fitness  landscape 
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FIG  9  Time  evolution  of  quasispecies  composition  during  the  H5-to-H3  adaptation  of  the  model  influenza  virus  segment,  simulated  under  the  empirically 
inferred  fitness  landscape  of  Fig.  8B.  Panels  A  to  I  show  the  sequential  snapshots  at  each  time  instance  (immediately  after  a  passage  cycle),  listing  five  genotypes 
with  the  highest  frequencies  (shown  in  logarithmic  scale).  The  parameter  values  were  as  follows:  V0  =  1  X  104,  Ua  =  I  X  10 6,a  =  1.0  X  1 0  day-1,  b  =  1.0  day-1, 
g  =  1  X  10-5,  and  t  =  3  days. 
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and  sequence  parameters  are  the  key  inputs  to  model  specifica¬ 
tion.  In  this  work,  we  first  adopted  a  random  sequence  model 
where  the  genomic  segment  under  consideration  ( La  =  1  to  3  aa; 
L  =  3  to  9  nt)  can  exhibit  fitness  values  assumed  to  be  random 
variables  around  a  mean  fitness,  which  decays  exponentially  with 
distance  from  an  MF  sequence  (see  Materials  and  Methods).  This 
choice  reflects  both  the  typical  distribution  of  mutant  fitness  over 
deleterious,  neutral,  and  beneficial  ranges  and  the  presence  of  one 
or  multiple  peaks  in  the  fitness  landscape.  This  assumption  has 
also  been  previously  validated  using  published  experimental  char¬ 
acterizations  of  HIV-1  protease  fitness  (28,  48). 

The  conclusions  drawn  from  this  part  of  our  study  pertain  to 
general  features  one  expects  to  see  in  observations  of  adaptation 
processes  common  to  many  different  viral  pathogens:  if  we  per¬ 
form  passages  in  host  environments  to  which  the  viruses  have 
already  been  well  adapted,  no  significant  new  drift  in  population 
structure  occurs  (Fig.  2).  In  reality,  however,  perfect  adaptation 


would  be  rare  and  one  almost  always  expects  drifts,  e.g.,  toward  a 
more  fit  genotype  2  aa  away  from  the  starting  WT  (Fig.  3).  The 
speed  of  this  adaptation  process  is  characterized  by  the  time  (or 
the  number  of  passage  rounds)  required  for  the  population  to  first 
discover  the  MF  sequence  and  then  become  dominated  by  it.  The 
jumping  rate  is  the  mean  inverse  time  of  this  discovery/growth 
process. 

The  multimodal  distribution  of  jumping  rates  shown  in  Fig.  4 
reveals  an  important  signature  of  the  topology  of  fitness  land¬ 
scapes  (49)  constrained  by  the  genetic  code:  if  we  visualize  the 
network  of  amino  acids,  where  edges  denote  single  substitutions 
(see  Fig.  S5  in  reference  28  and  Fig.  10),  an  evolutionary  path 
would  frequently  get  stuck  in  genotypes  with  amino  acid  se¬ 
quences  with  significantly  higher  fitness  than  the  WT  but  without 
a  direct  access  to  the  MF.  We  thus  expect  that  there  is  always  a 
nonzero  probability  for  a  certain  evolutionary  adaptation  path¬ 
way  to  never  reach  a  high-fitness  “target”  genotype  even  when  it  is 
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FIG  10  Influenza  virus  genotype  network  corresponding  to  the  fitness  land¬ 
scape  in  Fig.  8B.  Nodes  represent  the  genotypes  (first  two  amino  acids)  and 
edges  connect  pairs  for  which  there  exists  at  least  one  single-nucleotide  muta¬ 
tion  separating  the  two  nodes.  The  arrowheads  on  the  edges  reflect  the  direc¬ 
tion  of  fitness  increases  (Fig.  8B).  The  shaded  nodes  are  the  dominant  geno¬ 
types  shown  in  Fig.  1 1 .  The  length  of  a  given  path  does  not  necessarily  equal  the 
number  of  nucleotide  mutations:  QA  — >  LA  — >  SA  requires  multiple  mutations 
within  the  Leu-coding  codons. 


close  by  in  sequence  space.  This  feature  can  be  interpreted  as  a 
consequence  of  the  “ruggedness”  of  the  fitness  landscape,  leading 
to  frequent  trapping  of  evolutionary  dynamics  of  adapting  organ¬ 
isms  (50). 

The  jumping  rate  /  depends  sensitively  on  the  distance  from 
WT  to  MF  sequences  (Fig.  5);  for  (x  of  — 10  5,  La  values  from  1  to 
3  aa  yield /  values  decreasing  from  10_1day_1to  10-2  day-1,  with 
the  latter  corresponding  to  jumping  times  of  ~  100  days,  which  is 
beyond  the  upper  limit  of  typical  passage  durations.  Recent  exper¬ 
imental  demonstrations  of  species  jump-capable  mutations  in¬ 
duced  by  serial  passages  are  consistent  with  this  observation: 
Sheahan  et  al.  used  serial  passages  in  cell  culture  (12)  to  probe  the 
adaptations  presumed  to  have  been  responsible  for  the  SARS-CoV 
outbreak  in  2004,  with  the  species  jump  from  bats  to  humans  via 
palm  civets  as  host  species.  They  engineered  a  SARS-CoV  epi¬ 
demic  strain  in  which  the  S  protein  receptor-binding  domain  was 
replaced  by  its  counterpart  in  the  civet  strain  (the  WT)  and  pas¬ 
saged  it  in  human  airway  epithelial  cells.  Only  after  they  manually 
introduced  to  the  WT  a  key  mutation  previously  known  to  signif¬ 
icantly  affect  receptor  binding  (51)  did  serial  passages  up  to  22 
days  produce  strains  with  significantly  enhanced  growth  and  rep¬ 
lication  in  human  cells.  Sequencing  of  these  adapted  strains  re¬ 
vealed  two  additional  mutations  in  the  receptor-binding  domain 
that  arose  during  passages  (12).  This  observation  agrees  with  our 
conclusion  regarding  the  likelihood  of  species  jump-capable  mu¬ 
tations  spontaneously  arising  during  serial  passages:  with  physi¬ 
cally  reasonable  parameter  sets,  more  fit  genotypes  present  in  the 
sequence  space  requiring  mutations  up  to  a  d  of  2  aa  can  easily  be 
discovered  within  time  periods  of  ~20  days  (jumping  rate  /  =  5  X 
10~2  day-1  in  Fig.  5),  while  those  further  away  will  be  much  less 
likely  to  appear. 


FIG  1 1  Mean  frequencies  of  the  top  five  dominant  genotypes  as  functions  of 
time  from  the  influenza  virus  adaptation  simulations.  Up  to  103  realizations 
under  the  condition  of  Fig.  9  were  averaged.  Other  genotypes  not  shown  have 
mean  frequencies  less  than  1%. 


To  examine  more  specific  instances  of  such  adaptation  pro¬ 
cesses  with  direct  relevance  to  experimental  investigations,  we 
specialized  the  model  to  the  respiratory  infection  adaptation  of 
H5N1  under  serial  passages  and  simulated  the  recent  gain-of- 
function  experiments  by  Herfst  et  al.  (6)  using  our  stochastic  sim¬ 
ulation  algorithm.  In  this  experiment,  passaging  WT  H5N1  in 
ferrets  did  not  yield  adaptation,  but  after  three  mutations  previ¬ 
ously  known  to  play  important  roles  in  influenza  virus  infection 
were  introduced,  10  passage  rounds  produced  strains  capable  of 
respiratory  infection  between  ferrets.  Analyses  of  sequences  re¬ 
vealed  two  key  mutations  that  arose  during  passages.  The  large 
number  of  sequences  currently  available  for  influenza  allowed  us 
to  determine  fitness  landscapes  using  a  computational  fitness  in¬ 
ference  procedure.  By  choosing  different  sequence  sets  for  this 
inference  (H5  and  H3  in  Fig.  8A  and  B,  respectively),  we  obtained 
landscapes  specific  to  different  host  environments  to  which  a  viral 
population  adapts.  The  four  amino  acid  sites  within  HA  chosen 
for  investigation  (Hisl03,  Thrl56,  Gln222,  and  Gly224)  are  those 
previously  shown  to  exhibit  adaptive  mutations  enabling  respira¬ 
tory-route  infection  in  experiments  (6).  In  the  experiments,  with 
two  mutations  (Q222L/G224S)  preintroduced,  serial  passaging 
produced  adapted  strains  with  two  additional  mutations  (H103Y/ 
T156A).  The  first  two  mutations  did  not  feature  prominently  in 
the  H5  or  H3  fitness  landscapes  (Fig.  8),  reflecting  the  engineered 
nature  of  these  mutations  that  compensate  for  the  inefficiency  of 
H5  binding  to  mammalian  receptors  (46).  Our  choice  of  taking 
the  WT  and  MF  genotypes  in  our  simulation  as  HTQG  and  SAQG, 
respectively,  each  the  (near)  highest  fitness  sequences  in  H5  and 
H3  landscapes,  respectively,  mimics  the  situation  where  an  H5 
strain  undergoes  adaptation  with  neither  the  compensatory  mu¬ 
tations  Q222L/G224S  nor  the  help  of  collective  changes  to  other 
sites. 

The  evolutionary  path  described  in  Fig.  9  illustrates  an  example 
of  how  the  adaptation  process  could  unfold  in  the  sequence  space 
under  the  selective  forces  of  realistic  fitness  landscapes.  The  series 
of  changes  to  the  highest-frequency  genotypes  within  the  quasi¬ 
species  (HTQG  -»  HAQG  ->  RAQG  ->  LAQG  — >  SAQG)  as  well 
as  their  timings  reflect  both  the  features  of  the  local  landscape 
visited  and  the  mutational  constraints  imposed  by  the  genetic 
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code  (Fig.  10).  The  first  change  (T156A)  occurred  almost  imme¬ 
diately  (t  =  3  days),  driven  both  by  a  large  fitness  increase  (4.9  — > 
12.3)  and  by  the  proximity  of  Thr-  and  Ala-coding  codons.  The 
remaining  part  of  the  trajectory  and  the  time  evolution  of  mean 
frequencies  of  dominant  genotypes  (Fig.  11)  show  the  typical  dy¬ 
namical  patterns  of  adaptive  evolution  we  found  in  these  simula¬ 
tions:  fairly  long  periods  of  drift  in  which  the  master  sequence 
remains  unchanged,  while  the  group  of  minority  genotypes  keeps 
evolving  (the  quasispecies  “explores”  the  immediate  neighbor¬ 
hood  of  an  established  peak),  the  gradual  growth  in  frequency  of 
some  minor  variants,  one  of  which  takes  over  as  the  new  master 
sequence,  and  the  eventual  discovery  of  the  global  MF.  The  last 
event  occurred  only  in  10%  of  the  simulation  runs  within  ~300 
days  (Fig.  11),  but  if  it  did  occur,  the  MF  invariably  became  dom¬ 
inant.  The  series  of  amino  acid  changes  at  position  103  (H  — *  R  — ■» 
L  — »  S)  and  its  timing  in  Fig.  9  are  the  results  of  the  compromise 
between  fitness  gains  (Fig.  8B)  and  the  nucleotide  substitution 
required:  e.g.,  CAT  —>  CGT  — *  CTT  — *  TCT,  the  last  step  requiring 
two  substitutions,  explaining  the  rare  occurrence  of  SAQG  as  the 
master  sequence  within  the  timescale  of  ~  100  days  (Fig.  11). 

The  experimental  MF  genotypes  for  positions  103  and  156  are 
Tyr  and  Ala,  respectively.  The  genotype  YAQG  had  a  relatively  low 
fitness  in  our  inferred  landscape  (ranked  19  in  Fig.  8B),  and  its 
frequency  remained  low  in  the  simulations.  This  discrepancy  is 
primarily  due  to  the  inadequacy  of  the  sequence  data  underlying 
the  inferred  landscape,  underrepresenting  rare  compensatory 
mutations.  In  addition,  it  reflects  both  the  constraints  imposed  by 
the  fitness  gradient  (Fig.  8B)  and  the  genetic  code:  the  genotype 
YA  is  only  accessible  from  the  WT  (HT)  via  HA  but  involves  a 
fitness  decrease  (Fig.  10). 

We  may  infer  insights  into  a  better  understanding  of  evolu¬ 
tionary  dynamics  of  viral  pathogens  in  natural  settings  from  the 
study  of  serial  passages.  Systematic  investigation  of  the  effects  of 
factors  influencing  the  relative  ease  of  species  jump,  such  as  those 
shown  in  Fig.  6,  can  play  important  roles  in  such  interpretations. 
For  instance,  the  host  cell  number  and  bottleneck  size  roughly 
correspond  to  the  size  of  susceptible  host  populations  and  the 
degree  and  severity  of  selection  present  during  the  natural  spread 
of  a  viral  infection,  whereas  the  infection  and  death  rates  charac¬ 
terize  the  population  dynamics  within  this  host  environment.  In 
this  viewpoint,  the  longitudinal  patterns  of  population  size  and 
quasispecies  structure  shown  in  Fig.  3  can  be  regarded  as  idealiza¬ 
tions  of  what  typically  happens  in  natural  infectious  cycles:  a  viral 
strain  ventures  into  a  new  host  environment,  completing  multiple 
cycles  of  rapid  growth  followed  by  stagnation  due  to  host  deple¬ 
tion  and  reemergence  in  a  fresh  host  population,  but  with  factors 
characterizing  each  round  highly  variable  and  unpredictable 
rather  than  uniform  as  in  passages.  Our  study  of  evolutionary 
dynamics  under  passages  suggests  that  adaptation  mostly  occurs 
in  the  early  growth  phase  (Fig.  3),  an  adapting  viral  population 
may  frequently  get  stuck  without  a  direct  route  to  a  neighboring 
highly  fit  strain  (Fig.  4),  and  the  speed  of  adaptation  is  most  sen¬ 
sitive  to  the  distance  (Fig.  5)  and  target/bottleneck  sizes  (Fig.  6). 
The  specific  instance  of  adaptation  simulated  under  the  empiri¬ 
cally  derived  influenza  A  virus  fitness  landscape  (Fig.  8)  explicitly 
demonstrates  that  in  reality,  a  quasispecies  adaptation  involves  the 
stochastic  evolution  of  both  the  master  sequence  and  its  clouds 
(Fig.  9  and  11),  instead  of  a  simple  jump  from  the  WT  genotype  to 
a  global  MF  sequence. 

We  restricted  our  study  to  the  evolutionary  dynamics  of  small 


genomic  segments  on  the  order  of  a  few  amino  acids.  With  the 
empirical  inference  of  fitness  landscapes  from  MSAs,  however,  it  is 
conceivable  to  expand  the  scope  to  larger  domains,  possibly  con¬ 
taining  multiple  proteins,  with  the  direct  coupling  analysis  taking 
into  account  correlated  evolution  of  multiple  amino  acid  sites.  In 
addition,  in  modeling  influenza  virus  evolution  with  large 
genomic  segments,  an  extension  taking  into  account  reassortment 
events  that  play  important  roles  in  influenza  virus  adaptation  (7, 
26,  52)  could  provide  more  realistic  descriptions. 
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